This dataset was created from information regarding meteorite landings over hundreds of years. Meteorites are extraplanetary objects that have breached earth’s atmosphere. As they breach the atmosphere, they leave behind glowing trails of matter and are often mistakenly referred to as ‘shooting stars’. The meteor may or may not collide into Earth’s surface, creating an impact crater. The dataset consists of meteorite names (based on location), recorded class, mass in grams, whether it fell or was found, year of occurrence, recorded latitude, and recorded longitude.
The objective of the class project is to analyze the data contained within the dataset for the purpose of informing the reader. This is done by examining both categorical and numerical variables, looking at the distribution of the data, applying the Central Limit Theorem, and performing various sampling method techniques.
As seen by the image below, there are multiple levels of classification for meteorites. At the highest level, meteorites can be divided into being Undifferentiated or Differentiated. The majority of the meteorites are Undifferentiated meteorites. Undifferentiated Meteorites themselves can be divided into four categories: Carbonaceous, Ordinary, Rumuruti (R), and Enstatite. Of these categories, the majority are Ordinary meteorites, which can be further divided into the categories: H Meteorite, L Meteorite, and LL Meteorite. The data will be wrangled to meet the need to examine the data across these three levels of classification.
Image Source: Clipartkey - Meteorite Classification Chart.
Using tidyverse, new columns will be created from the original column ‘recclass’ representing each meteorite’s classification for each of the three levels. Meteorites that are not in a particular subclass will have an N/A value for that respective column. Then, a final column, ‘Lowest_Class’, will be created that shows each meteorite’s lowest classification level.
The dataset includes information from the year the meteorite crashed through to Earth. A barplot of year by frequency shows this data in a concise way. The y-axis is logarithmically scaled to provide a cleaner plot.
A stacked barplot breaking down the meteorites by type into a lower classification level provides a viewer with insight into trends of meteorite landings.
Further breaking down Ordinary meteorites to their lowest type reveals more trends. Of note, the years with the largest frequency spikes seem to correlate with a high number of LL type meteorites.
The dataset also includes whether the landed meteorite fell or was found. A histogram breaking down probability of ‘Fell’ vs ‘Found’ across classification levels provides insight with regard to which is more likely dependent on classification.
Further lowering the classification groupings, the following is a breakdown of ‘Fell’ vs ‘Found’ across all Undifferentiated meteorites by types Carbonaceous, Enstatite, Ordinary, and Rumuruti (R).
Lastly, this histogram examines ‘Fell’ vs ‘Found’ of Ordinary type meteorites. While L type meteorites had an almost even split with regard to ‘Fell’ vs ‘Found’, LL type meteorites were heavily found as opposed to fell. Almost 90% of LL type meteorites were found after impact rather than seen falling.
These two pie charts break down the percentages of lowest classification level for meteorites dependent on whether they fell or were found. Of note, LL type meteorites are only the fourth highest share of the pie for ‘Fell’, while they are the highest slice for ‘Found’.
The dataset also includes recorded mass of meteorites in grams. The boxplots below, with one examining all meteorites and the next two highest classification levels and the other examining Ordinary type meteorites and their respective classification, show the associated five number summary with respect to mass as well as outliers. The y-axis is logarithmically scaled to provide a cleaner view of the data, given that the vast majority of meteorites are on the smaller end with regard to mass, with a number of very significant outliers. Two interesting points within the first boxplot are the higher Q1, Median, and Q3 values for Differentiated Mass compared to Undifferentiated Mass and that the largest meteorite by mass, with a multitude greater than five times heavier than the second heaviest meteorite, is a Differentiated meteorite.
Two aspects that stand out with regard to the second boxplot are the drastically smaller Q1, Median, and Q3 values for LL type meteorites when compared to other types of Ordinary meteorites or Ordinary meteorites as a whole. A second point of observation is that the two largest meteorites by mass are H type meteorites, both significantly heavier than the heaviest L type or LL type meteorite.
A scatterplot displaying the different mass values for each meteorite by year provides a colorful way to examine the mass of meteorites by year grouped by lowest classification type. The first scatterplot shows the mass for each meteorite, logarithmically scaled, broken down by type, per year.
The next scatterplot shows the aggregated maximum mass across types per year. It is interesting to note a general downward trend exists for Enstatite type meteorites. Considering the relative rarity of the Enstatite type, this is a strong indicator that modern technology is capable of identifying much smaller meteorites than technology of generations past.
This scatterplot shows the aggregated minimum mass across types per year. There appears to be a slight downward trend in the data. As mentioned previously, this is most likely an indication of human advancement in detecting even the smallest meteorites.
This graph shows a scatterplot of aggregated mean across multiple meteorite types broken down by year.
Lastly, this scatterplot shows the aggregated standard deviation of mass across meteorite types by year. A point of interest is that prior to 1914, only one year had recordings of two or more LL type meteorites.
This bar chart shows the mean and median mass of each lowest classification type. The y-axis is logarithmically scaled. Data is said to be skewed right if the mean is greater than the median while data is said to be skewed left if the mean is less than the median. For all types except Rumuruti (R), the mean is vastly greater than the median which implies a heavy skew right.
The below density curve shows a density curve of mass, logarithmically scaled, broken down across lowest level classification. While there are only five Rumuruti (R) meteorite type data points in the dataset, it is interesting to note the shape of that curve as well as the curves for LL type meteorite and H type meteorite compared to the other density curves.
The dataset also includes recorded latitude and longitude for each meteorite landing. The following map displays all meteorite impact locations color-coded by lowest classification level and size scaled by mass. Hovering over a data point on the map displays the name of the meteorite, the mass, and the year of impact- coded by color.
The following two histograms demonstrate the distribution of the mass variable of all meteorites with a recorded mass value. Both graphs show that the data is heavily skewed to the right, with a huge number of outliers including a single massive outlier. The following probability histogram demonstrates this.
The second of the two histograms shows density and is logarithmically scaled to provide a clearer view of the curve. It is interesting to note that while density peaks around 10 grams, it is not a smooth path down the curve to the x-axis. There is a significant upward bump in density around 1000 grams which leads to a temporary increase in density before falling once again.
The Central Limit Theorem states that the distribution of sample means taken independently from a population are normally distributed even if the population itself does not follow a normal distribution. The sample means will also match the population mean. Furthermore, the theorem states that the standard deviation of a sample is equal to the standard deviation of the population divided by the square root of the associated sample size. In the following example, the mean and standard deviations for samples of size 25, 50, 75, 100, 125, and 150 will be analyzed.
## Sample Size = 25 Mean = 23828.23 SD = 88375.01
## Sample Size = 50 Mean = 23349.81 SD = 62559.92
## Sample Size = 75 Mean = 22656.07 SD = 50621.17
## Sample Size = 100 Mean = 22957.99 SD = 42271
## Sample Size = 125 Mean = 24963.18 SD = 43324.17
## Sample Size = 150 Mean = 25352.45 SD = 39036.73
## [1] "Population Mean: 23766.27"
## [1] "Population SD: 457823.43"
## [1] "CLM Estimate given Sample Size of 25 is: 91564.69"
## [2] "CLM Estimate given Sample Size of 50 is: 64746.01"
## [3] "CLM Estimate given Sample Size of 75 is: 52864.9"
## [4] "CLM Estimate given Sample Size of 100 is: 45782.34"
## [5] "CLM Estimate given Sample Size of 125 is: 40948.97"
## [6] "CLM Estimate given Sample Size of 150 is: 37381.13"
The complete distribution table among lowest classification levels for the population is below.
##
## Carbonaceous Differentiated Enstatite H Meteorite L Meteorite
## 0.02 0.10 0.01 0.34 0.26
## LL Meteorite Rumuruti (R)
## 0.27 0.00
Below is the distribution table broken down by lowest classification level created with simple random sampling without replacement, utilizing a sample size of 125.
##
## Carbonaceous Differentiated Enstatite H Meteorite L Meteorite
## 0.024 0.104 0.008 0.288 0.288
## LL Meteorite
## 0.288
This table shows the distribution of count by lowest classification level with sample size 125, from the use of systematic sampling.
##
## Carbonaceous Differentiated H Meteorite L Meteorite LL Meteorite
## 0.016 0.120 0.320 0.264 0.248
The following table shows the distribution of count by lowest classification level created with systematic sampling with unequal probabilities, with a weight on mass. A sample size of 125 was used.
##
## Carbonaceous Differentiated Enstatite H Meteorite L Meteorite
## 0.032 0.328 0.016 0.256 0.288
## LL Meteorite
## 0.104
The last distribution table is broken down by lowest classification level and was created using stratified sampling, with a sample size of 125.
##
## Carbonaceous Differentiated Enstatite H Meteorite L Meteorite
## 3 11 1 42 32
## LL Meteorite
## 34
Comparing the population mean against the various sample means, the closest mean came from the stratified sampling mean. This mean was significantly larger than the population mean. The mean from systematic sampling with unequal probabilities was drastically higher, which makes sense given the huge rightward skew of the data by mass and the fact that inclusion was weighted by mass. Lastly, the mean from simple random sampling and systematic sampling was lower than the population mean.
## [1] "Mean of Data from Simple Random Sampling Without Replacement is: 12297.35"
## [2] "Mean of Data from Systematic Sampling is: 9743.67"
## [3] "Mean of Data from Systematic Sampling with Unequal Probabilities is: 455141.75"
## [4] "Mean of Data from Stratified Sampling is: 38227.68"
## [5] "Mean of Data from Population is: 23766.27"